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Abstract 

This paper proposes a strategy for regularized estimation in multi-way con- 
tingency tables, which are common in meta-analyses and multi-center clinical 
trials. Our approach is based on data augmentation, and appeals heavily to 
a novel class of Polya-Gamma distributions. Our main contributions are to 
build up the relevant distributional theory and to demonstrate three useful 
features of this data-augmentation scheme. First, it leads to simple EM and 
Gibbs-sampling algorithms for posterior inference, circumventing the need for 
analytic approximations, numerical integration, Metropolis-Hastings, or vari- 
ational methods. Second, it allows modelers much more flexibility when choos- 
ing priors, which have traditionally come from the Dirichlet or logistic-normal 
family. For example, our approach allows users to incorporate Bayesian ana- 
logues of classical penalized-likelihood techniques (e.g. the lasso or bridge) 
in computing regularized estimates for log-odds ratios. Finally, our data- 
augmentation scheme naturally suggests a default strategy for prior selection 
based on the logistic-Z model, which is strongly related to Jeffreys' prior for a 
binomial proportion. To illustrate the method we focus primarily on the par- 
ticular case of a meta-analysis/multi-centcr study (or a J x K x N table). But 
the general approach encompasses many other common situations, of which 
we will provide examples. 



* Poison is Professor of Econometrics and Statistics at tlic Chicago Booth School of Business, 
email: ngp@chicagobooth.edu. Scott is Assistant Professor of Statistics at the University of Texas 
at Austin, email: James.Scott@mccombs.utexas.edu. 



1 Introduction 



In this paper, we fit hierarchical Bayesian models for multi-way contingency tables 
using data augmentation. We focus on J x K x N tables, which are common in 
meta-analyses or multi-center studies. One reason for the relative dearth of practi- 
cal, exact Bayesian approaches to these problems is the nonlinearity (and associated 
nonconjugacy) of their likelihoods. Our data-augmentation approach directly ad- 
dresses this issue. It thereby avoids the need for analytic approximations, numerical 
integration, Metropolis-Hastings, or variational methods. We also describe many 
extensions involving logistic-type models that rely upon the same basic framework. 
These extensions encompass many areas of wide interest in modern statistical prac- 
tice, including mixtures of logits and mixed-membership/topic models. 



Table 1 presents a simple example of a multi-way table, from Skene and Wake- 



field (1990). The data arise from a multi-center trial comparing the efficacy of two 
treatment arms — in this case, different topical cream preparations, labeled the treat- 
ment and the control. This table suggests two advantages of pooling information 
across treatment centers in a hierarchical Bayesian model, both articulated by many 
previous authors: it sharpens the estimate of the overall difference between treat- 
ment and control, and it regularizes those proportions whose maximum-likelihood 
estimates would otherwise be identically zero — for example, in the control group at 
centers 5 and 6, where no successes were observed. 

The goal of this paper is not to propose new statistical models for such tables. 
Rather, we work firmly within the context of existing models, showing: (1) that 
these models have a normal mixture representation involving Polya-Gamma random 
variables; (2) that this mixture representation is of great practical relevance, since 
it leads to new, efficient Gibbs and EM algorithms for posterior computation; and 
(3) that this representation also allows users far greater flexibility in specifying 
computationally tractable priors, including default or "objective" priors. 

Our fundamental contribution is the data-augmentation scheme of Section [2} 
which appeals heavily to a novel class of Polya-Gamma distributions. The associated 
distributional theory depends, in turn, upon the Levy representation of Fisher's Z 
distribution, discussed extensively by Poison and Scott (2011b). This allows us to 
represent logistic likelihoods directly as mixtures of normals. The resulting mixture 
representation is quite parsimonious, in that it involves only one latent variable 
per cell in the table. This compares favorably with other, different latent-variable 
methods that involve one latent variable per observation — for example, the random- 
utility representation of the logit model used by Holmes and Held (2006) in the 
context of logistic regression. 

Our work is closest to that of Leonard (1975), Skene and Wakefield (1990), and 



Forster and Skene (1994) in two respects: we parameterize tables in terms of log-odds 
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Table 1: Data from a multi-center, binary- response study on topical cream effec- 
tiveness (Skene and Wakefield, 1990). 



Treatment Control 



Center 


Success 


Total 


Success 


Total 


1 


11 


36 


10 


37 


2 


16 


20 


22 


32 


3 


14 


19 


7 


19 


4 


2 


16 


1 


17 


5 


6 


17 





12 


6 


1 


11 





10 


7 


1 


5 


1 


9 


8 


4 


6 


6 


7 



ratios; and we use logistic-normal priors, along with their generalizations, within a 



hierarchical Bayesian model. It is also closely related with the work of Gelman et al. 



(2008) and Bedrick et al. (1996), and to a lesser extent that of Gelman (2006), in 
that we propose a new default prior for log-odds ratios in logistic-type models. 
Extensive bibliographies on Bayesian methods for categorical data analysis can 



be found in Agresti and Hitchcock (2005 ), Forster (2010 ), and Chapter 5.7 of Gelman 



et al. (2004). Similar issues of computational tractability arise in ecological infer- 



ence for 2x2 tables (Wakefield, 2004); in other models for meta-analysis (Gray 



1994 Carhn 



1992 



and Hamurkaroglu 



and in Poisson-type models for 2 x 2 x tables (Demirhan 



2008). Other important Bayesian papers on contingency tables 



include Altham (1969) and Crook and Good (1980). While the focus of our analysis 
is not on testing for independence, in the manner of Diaconis and Efron (1985), it 
is also possible to compute Bayes factors using our algorithms. 

The paper is organized as follows. Section [2] presents our main result concerning 
the Polya-Gamma mixture representation for the logit-type likelihoods that arise in 
multi-way tables, beginning with the 2x2 case and working up from there. Sections 
|3]and|4]use this representation to derive simple EM and Gibbs-sampling algorithms, 
respectively, for posterior computation. Section [5] analyzes the data from Table [T] 
to illustrate the proposed method. Section [6] proposes a default logistic-Z prior for 
the models of Section [2| appealing to Jeffreys' arguments about priors for binomial 
proportions. Section [7] considers the general J x K x N case. Section |8] concludes 
with several remarks about generalizations and interesting features of the overall 
approach. Proofs of our main results, along with the distributional theory of Polya- 
Gamma random variables, are deferred to appendices. 
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2 Data augmentation for multi-way tables 



2.1 A single table 



First, consider the situation of a binary-response trial designed to compare an active 
treatment with a control. We will use this simple case to introduce the basic theory 
behind our approach, before generalizing to the case of a multi-center multinomial 
response trial (or a J x K x N table). 

For the treatment group, we observe yi successes among rii subjects, while for 
the control group, we observe 1/2 successes among n2 subjects. Let pi denote the 
underlying success probability for the active treatment, and p2 the success probabil- 
ity for the control. Let p = (^1,^2)'- Clearly the likelihood is a product of binomial 
probability mass functions: L{p) = (pi)^i(l — (j92)^^(l — ]52)"^~^^- 

Let ipi and iIj2 denote the log-odds ratios corresponding to pi and p2, respectively: 



^1 = log 



Pi 



l-pi 



and ip2 = log 



P2 



1 -P2 



We first consider the case where if) is assigned a bivariate normal prior with mean 
/i and CO variance matrix S, both fixed in advance. The posterior distribution given 
data D = {yi, ^i, ?/2, '"■2} is 



pi^ip I D) oc p{tp) p{D I ip) 



f 1 1 (e'^i)^i 
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where we have re- written the likelihood in terms of the log-odds ratios ip = {ipi, 1^2)'- 
This does not factorize easily into an analytically convenient form, and has tradi- 



tionally been analyzed using numerical integration (c.f. Skene and Wakefield, 1990), 



analytic approximations to the likelihood (Carlin, 1992 Gelman et al. , 2004), or 



Metropolis-Hastings (Dobra et al. 2006). 



Our main result is that this posterior distribution, far from being intractable, 
is actually a mixture of bivariate normal distributions. So too are many logit-type 
likelihoods similar in overall structure. This leads to very simple EM and MCMC 
schemes for posterior computation; we will present these algorithms shortly, after 
describing the mixing distribution itself. 

The mixing distribution in this conditionally normal representation is from a new 
class of random variables, which we call the Polya-Gamma class. As their name sug- 
gests, Polya-Gamma distributions are closely related to Polya distributions, or infi- 
nite convolutions of exponentials. In the appendix, we summarize some basic facts 
about the Polya-Gamma distribution, including its density and moment-generating 
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function. (A special case of the density, for example, is a multi-scale mixture of 
inverse-Gaussians, which is an interesting generalization of a standard prior.) Here, 
we simply introduce the class in a manner that lends itself straightforwardly to 
simulation. 

Definition 1. A random variable X ~ VQ{a,c) has a Polya-Gamma distribution 
if 

^ ^ (2) 



^ 27r2 ^ (A;- l/2)2 + c7(47r2) 



where each gk is an independent gamma random variable: gt ~ Ga{a,l). 

We can now state the main result of the paper, which is proven in the appendix. 



Theorem 1. The posterior distribution in ([ijj is a mixture of normals with respect 
to latent variables f2 = diag{(jJi,uj2) ■ This mixture takes the following form. 

Part A: The posterior p{il) \ D) can be expressed hierarchically as 

{^\D,Q) ~ ^{mn,Vn) (3) 

where uj are latent variables with prior distribution 

Ujr^Vgin^,0), (4) 

for j = 1, 2; and where 

v^' = n + E-' 

= {yi-ni/2,y2-n2/2)' . 

Part B: The conditional posterior p{ujj \ ipj,D) arising from the model in (^^(^ 
is also in the Polya-Gamma family: 

{uj I ijjj, D) ~ VQ {rij, ipj) (5) 

for i = 1,2. 



2.2 A series of tables 

Suppose now that similar binary-response trials are conducted in each of different 
treatment centers. Let Uij be the number of patients assigned to regime j in center 
i, and let yij be the corresponding number of successes, for i = 1, . . . , A^ and for 
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j = 1 (active treatment) and j = 2 (control). As for the case of a single table, let 
Pij denote the underlying success probabilities, and ipij the corresponding log-odds 
ratios in favor of success. 

Assuming that the individual terms ip are conditionally independent (given some 
common set of hyperparameters) , the posterior for \E' = {ipij} is 

Suppose that, as before, we assume a bivariate normal prior: ipi = {ipii,ipi2)' ~ 
A^(/i, S). Applying Part A of Theorem [l] to each term in the posterior, we may 
introduce augmentation variables flj = diag(a;ji, a;j2) to arrive at the following con- 
ditional representation: 

{ij^\D,n,) ~ Af{m,,Vn,) (6) 
uJij ~ VQ{nij,0) for j = 1,2, 

where 

a + 

V^n.(/ti + S-V) 
iVii - nil/2, yi2 - ni2/2y . 

Moreover, applying Part B, 

{uij I ipij, D) ~ VQ {riij, ipij) . (7) 

We now use this representation to derive simple EM and Gibbs-sampling algorithms 
for estimating the model parameters. 

3 MAP estimation via EM 

We employ an EM algorithm to estimate the posterior mode for \E', beginning with 
the case where fi and S are pre-specified. 

Let denote our current estimate of the vector of log-odds ratios. In the E 
step, we compute the expected value of the log posterior distribution, given this 
current guess, marginally over the augmentation variables Q. Since the posterior 
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Algorithm 1: EM for 2 x 2 x iV tables 
(normal prior, fixed // and E) 

Begin with an initial guess ^^^\ 

For iteration g — 1,2,... 

E Step: For i = 1 : n and j = 1 : 2, set 

:=^tanh(^J|V2). 

Yij 

M Step: For i — 1 : n set 

for Q!f^ = di&g{u\f ,u\$) and Ki = {yn - 71^1/2,1/^2 - ^1^2/2)'. 
End when the sequence of estimates {^^^\ . . .} has converged. 

Figure 1: An EM algorithm for estimating log-odds ratios in a 2 x 2 x table. 



given Vt is conditionally normal, 
I = E{logp(* I 

= E 1 ^^^^^^ - + ^^^^^^ + - ^(^^ - /^)'^"'(^^ - 



i=l 



where 



= E (a;., I 



All expectations are under the conditional posterior distribution for fi, given the 
current guess ^^9\ This final step is justified because the objective function is 
linear in the w^j's, and because these terms are conditionally independent in the 

posterior, given the ■j/'j/s. 

In the M step, we maximize this as a function of all the ipiS jointly to yield the 
next estimate, Since the ipiS are conditionally independent in the posterior, 

given n, the maximizing value of ■0i can be trivially computed using standard normal 
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theory as 



(5+1) 



(9) 



^-1 



[Ki + s V) 



This is essentially weighted least squares, although it is unusual in the sense that the 
weights also appear as part of what would normally be construed as the dependent 
variable in a regression. An interesting comparison is with the methods for sparse 



Bayes estimation proposed by Poison and Scott (2011c). 



To run the algorithm, it is therefore sufficient to know the conditional expected 
value of the latent precision Uij to plug in to the E Step. The following lemma, 
proven in the appendix, allows this quantity to be computed without difficulty. 

Lemma 2. Suppose X ~ VQ{a,c). Then 



E(X) = - tanh(c/2) 
c 



(8) 



Applying the lemma and Equation ([T]) to the case at hand, it is clear that 



(9) 



tanh('?/'jj/2) . 



We summarize the resulting EM algorithm in Figure [TJ We also describe an 
ECM algorithm (Meng and Rubin 1993) in Figure [2| whereby and S are also 
iteratively updated by conditional maximum likelihood, given the current estimate 
Two obvious hybrid strategyies, omitted from either figure, are: (1) to fix E, 
while still iteratively updating /x; and (2) to further regularize /x and S using a prior 
distribution. 



4 Gibbs sampling 

To explore the joint posterior distribution over the log-odds ratios {ipij}-, we use 
a simple Gibbs-sampling algorithm. The relevant conditional distributions for ipij 
and ujij are read off directly from Equations ^ and ([T]), and need no further elab- 
oration. We draw attention only to one curious property of these Gibbs updates: 
the conditional posterior distribution for uoij does not have an explicit closed-form 
density representation, but it is still very easy to sample from. 

We incorporate uncertainty in (/i, E) via a normal- Wishart hyperprior for /i and 
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Algorithm 2: ECM for 2 x 2 x tables 
(normal prior, unknown fi and E) 

Begin with an initial guess S*-^-*. 

For iteration g = 1,2, . . . 

E Step: For i = 1 : n and j = 1 : 2, set 



CM Step: Update fi, and S in turn 



elf :=^tanh(^Jf/2). 



Update \1/: For i = 1 : n set 

(0.'»' + E-;,)-'(.. + E-><»)) 

for n\'^^ = diag{u'lf,J£) and = {yn - nii/2, yi2 - naji)' . 
Update and S: Let 

,ig) _ 



N 
1=1 

End when the sequence of estimates {^l/^^-*, . . .} has converged. 



Figure 2: An ECM algorithm for estimating log-odds ratios in a 2 x 2 x A^ table 
where hyperparameters are estimated by maximum likelihood. 



A = S ^, leading to joint distributions of the form 



A) oc |A| 2 exp 



-tr(5A) ) 



p{^p,fi,A) (X |A|f exp 1^-^ Yli^' - /^)'^(^' - ■ lAl"^ exp (^-^tr(5A) 

By comparison (Skene and Wakefield, 1990) used the improper uniform prior for 
and a ZW{d, B) prior for E = A~^. Applying standard theory of the conjugate 
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normal-Wishart family, this models leads to conditionals of the form 



N 



(/i I fi, ^, S) ~ N-^ J2 N-^^ 

(s \ ^f,f^)r^iw(^d+N,B + J2{,p' - - f^y 

The prior expectation of S is given by 

E(S) = E(A-i) ^ 



d-3 



The full MCMC is summarized in Figure [3j The only non-standard part of this 
algorithm is the generation of random variables from a Polya-Gamma distribution. 
To do this, we use the representation in (|2]), and truncate the sum of Gamma random 
variables after some large number K. We have found that K = 200 works well in 
practice, and that larger values made no discernible difference to the sum in the 
cases we examined. Clearly this is an important (and easy) part of the sampler to 
check in examining the robustness of inferences. 

The generation of so many Gamma random variables, merely to simulate a sin- 
gle Polya-Gamma random variable, may seem onerous. But this can be done very 
rapidly on modern computers, and is far less time-consuming than it appears. More- 
over, multi-core processing environments are rapidly becoming the norm, and for 
tasks such as this offer a speedup that is essentially linear in the number of cores 
available. 



5 Example: a multi-center study on topical creams 



To illustrate our Gibbs sampler, we analyze the data in Table [T| from a multi- 
center study assessing the effectiveness of topical creams. In contrast to the original 



analysis in Skene and Wakefield (1990), we are able to avoid numerical integration 



by using the Gibbs sampler previously described (Algorithm 3). 

We use a normal-Wishart prior, as described above. Hyperparameters were 



chosen to match Table II from Skene and Wakefield (1990), who parameterize the 



model in terms of the prior expected values for p, aj, and a1, where 



P 



To match their choices, we use the following identity that codifies a relationship 
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Algorithm 3: Exact Gibbs sampling for 2 x 2 x N tables 
(normal prior for ipi, N-IW prior for fj, and E) 

Begin with an initial guess /z*^^), E^^^. 

Update Q: Draw each ix!ij as 

{ujij I ipij, D) ~ VQ {riij, ipij) . 

Update Draw each ipij as 

{iPi\D,ni) ~ Af{mi,Vn,), 

where 

= diag(a;ii,a;ii) 

m = Vh,(Ki + E"V) 

«j = {yn - nil/2, - 71^2/2)' . 
Update jj, and E: Draw 

(E I //) ~ XW U + TV, S + J^iiji - - //)' 



Figure 3: A Gibbs-sampling algorithm for exploring the posterior distribution for 
the log-odds ratios in a 2 x 2 x table. 
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Log-odds ratios in an 8-center binary-response study 



Treatment (95% CI) 
Control (95% CI) 



Treatment Center 



Posterior distribution for (mu1, mu2) 




Odds of success for treatment 



Figure 4: Top: Posterior distributions for the log-odds ratio for each of the 8 centers 
in the topical-cream study from Skene and Wakefield (1990). The vertical lines are 
central 95% posterior credible intervals; the dots are the posterior means; and the 
X's are the maximum-likelihood estimates of the log-odds ratios, with no shrinkage 
among the treatment centers. Note that the MLE is ipi2 = — oo for the control group 
in centers 5 and 6, as no successes were observed. Bottom: draws from the joint 
posterior for /i = {[11,^2)', with the black line indicating the line where the two 
means are equal. 
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between the hyperparameters B and d, and the prior moments for marginal variances 
and the correlation coefficient. If S ~ XyV{d, B), then 



R = (d-'^)\ + + 2 E (p)v/E(af)E K) E{al) + E(p) VWT)W^) ' 

E(aD + E(p)v/E(aDE(a|) E(aD 

In this way we are able to map from pre-specified moments to hyperparameters, 
ending up with d = 4 and 

/ 0.754 0.857 \ 
~ V 0-857 1.480 J ■ 

The results of fitting this model via Gibbs sampling are summarized in Figure 
|4| which compares the posterior distribution of with the maximum-likelihood 
estimator (which allows no pooling of information across the treatment centers). 
Observe the effect of the shrinkage induced by the Bayesian model, particularly in 
the case of centers 5 and 6. Also note that — while no individual center seems to 
produce overwhelming evidence that the treatment improves upon the control — 
the posterior distribution for /x supports the efficacy of the treatment quite strongly. 
This shows the potential of our method for quantifying uncertainty in meta-analyses 
precisely of this kind. 



6 Prior choice 



6.1 Hierarchical Z priors 

Until now we have used a normal prior for each two- vector ipi, one which incorporates 
subjective information via fi and S. For many applications this will involve no 
difficulty. But as a default procedure, it poses two issues. First, the tails of the 
normal prior are thinner than the tails of the logistic likelihood, a situation widely 



known to yield non-robust inferences (e.g. West, 1984). This issue is particularly 



acute in the case of the logit likelihood, which is highly sensitive to large values. 



Second, in the spirit of Gelman et al. (2008), we would like a "default" or weakly 



informative prior for log-odds ratios that lies somewhere between two extremes: fully 
informative priors, and formal noninformative priors (e.g. reference priors). This is 
particularly crucial if one intends to use the framework for model selection, in which 
noninformative priors cannot in general be used. 

In choosing such a default prior, it is crucial to get two things right: the tails. 



and the scale. To this end, Gelman et al. (2008) propose the use of a Student 



t prior in logistic regression, where the likelihood can be well approximated by a 
with a scale parameter of 2.5. We agree with their basic reasoning leading to 
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this choice. Our only modification of their framework is to show how, using our 
data-augmentation scheme, one may conduct exact inference using a logistic-Z prior 
rather than a t prior. This will directly (rather than approximately) match the tails 
of the prior with that of the likelihood. Moreover, it will do so with no extra model 
complexity or computational cost, compared with the Student-t case. 

To show this, we revisit some basic distributional properties of log-odds ratios. 
Let p be a success probability and ip the corresponding log-odds ratio. By definition, 
if) = log (p/(l — p)) with inverse p = e"^ /{l + e"^). The Jacobian is dip /dp = (1— p)^^. 
This leads to the following distributional identity: 

p ~ Be(a, b) implies ip = log ( — - — ) ~ Z (a, b, 1, 0) , 

\l-pj 



where Z denotes Fisher's Z distribution (Fisher, 1921 Barndorff- Nielsen et al. 



1982). For example, Jeffreys' prior for a proportion p is 

TT ^p{l - p) 

which implies that 

1 e^^ 

pW 



vr 1 + e'^ ' 

a Z ^, 1, 0) prior for the log-odds. 

One possible choice of (independent) prior is therefore just a product of Z{aij, b^, 1, 
distributions for the ipt/s, leading to a posterior of the form 



N 



p{^ I D) OC J] {p{^i)p{Di I ^i) p{iP2)p{D2 I ^2)} 



i=l 

( gJ/ilV'il Qyi2'^i2 gdil'/'il QO.i2lpi2 



OC 



oc 



IT (^X + e'^'i)"" (1 + e'^'2)ni2 (X -|- g'/'ii)<iii+bii (^x + e'^")"'2+''i2 

( g(yil+aii)'/'ii g{l/i2+ai2)'/'i2 

11 (^X + e^'l )""+""+**! (1 + e'^'2^ni2+ni2+fei2 



This leads to an obvious modification of the closed-form updates in the EM and 
Gibbs-sampling algorithms already presented, and induces no extra computational 
difficulty. Under this framework, and bij can be interpreted as pseudo-data — 
specifically, the prior number of successes and failures at center i for treatment j. 
A reasonable default choice might be aij = bij = 1/2, following Jeffreys' original 
argument. 
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This choice, however, precludes the possibihty of learning about hyperparam- 
eters, and therefore pooling information across tables. To allow this within the 
context of a default prior specification, we propose a hierarchical Z prior with fixed 



scale, with tails that will match the likelihood. Following Barndorff-Nielsen et al. 



(1982), this can be represented as a variance-mean mixture of bivariate normals 



with respect to a Polya mixing distribution: 

{^i I /i, c^o) ~ M{yjQ\i, wqC) 
Wo ~ Pol(l/2,l/2) 
/X ~ A/'(0,/), 

where C is some fixed correlation matrix (perhaps the identity). This entails only 
minimal changes to the Gibbs-sampling updates for and /x in Algorithm 3. The 
only additional step is the simulation of a single exponentially tilted Polya random 
variable, which can be done using the methods described in [Gramacy and Poison 



(2010). 



6.2 Other shrinkage priors 

Our hierarchical normal representation for the likelihood means that Bayesian ver- 
sions of penalized-likelihood procedures can easily be used to yield regularized esti- 
mates of log-odds. Consider, for example, the model where 

(z, |A,,/x,S) ~ Ar(0,Ai) 

Ai = diag(Aii, Ai2) 

Zi = (V'il,^j2)' 

If, for example, Xij ~ Ex(2), then we have specified a lasso-type prior for ipn, as 
well as for the contrast ipn — ipi2. Many other choices are possible (e.g. horseshoe or 



bridge priors), with Poison and Scott (2011a) providing an extensive bibliography. 



The posterior mode under such a specification can — if warranted by the data — 
collapse to a solution where Zi2 = 0, in which case the treatment and control are 
estimated to be equally effective at treatment center i. 

7 Generalizations 

Now consider a multi-center, multinomial response study with more than two treat- 
ment arms. This can be modeled using hierarchy of different two-way tables, each 
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having the same J treatment regimes and K possible outcomes. The data D con- 
sist of triply indexed outcomes Uijk-, each indicating the number of observations in 
center i and treatment j with outcome k. We let riij = i/ij indicate the number 
of subjects assigned to have treatment j at center k. 

Let P = {pijk} denote the set of probabilities that a subject in center i with 
treatment j experiences outcome k, such that '^f.Pijk = 1 hJ- Given these 

probabilities, the full likelihood is 



N J K 



i(^)=nnnpSi' 



i=i j=i k=i 



Following Leonard (1975), we model these probabilities using a logistic transfor- 
mation. Let 

_ exp{ilJijk) 

Pijk 



We assume an exchangeable matrix- normal prior at the level of treatment centers: 



c) 



where ipi is the matrix whose {j,k) entry is ipijk', M is the mean matrix; and 12 r 



and Ec are row- and column-specific covariance matrices, respectively. See Dawid 



(1981) for further details on matrix-normal theory. Note that, for identifiability, we 
set ipijK = 0, implying that Sc is of dimension K 
This leads to a posterior of the form 



1. 



N 



p{^\D) = -l[ 



1=1 



J K 



pm n n 

i=l k=l 



exp{ipijk) 



Vijk 



E£iexp(V^. 



iji) 



suppressing any dependence on (M, S/j, Sc) for notational ease. 

To show that this fits within the Polya-Gamma framework, we use a similar 



approach to Holmes and Held (2006), rewriting each probability as 



Pijk 



ijk) 



Y^i^k exp(V^ij7) + exp(V^,jfc) 



where Cijk = log{^;_^j!j exp{ipiji)} is implicitly a function of the other V'iji's for / ^ k. 
We now fix values of i and k and examine the conditional posterior distribution 
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for ipi.k = {iJiik, ■ ■ ■,ipijky, given ipi.i for / ^ k: 



i'ijk-Cijk \ y^-i'^ / 1 \ '^ij-Vijk 



J 



]^ _|_ ^'4^ijk ^ijk 



^Vijk i'^ijk ^ijk ) 

This is simply a multivariate version of the same bivariate form in ([T|. Appealing 
to the theory of Polya-Gamma random variables outlined in the appendix, we may 
express this as: 

^ ^^ijk ['^ijk ''ij'fc] 

piifji.k I D, Vi-c-fc)) oc pitpi.k I tpi.(-k)) ■ n — U^. 



j^^cosh'''^ {[ijijk - Cijk]/2) 
J 

Piiji.k I iji.(-k)) ■ n ^e''^^^^^^'""'''^^^ . |^^-'^^Jk[i'^,k-C,,k]V■2'j 



where Uijk ~ VG{nij,0), j = 1, . . . , J; and Kijk = fjijk - nij/2. Given {ujijk} for 
J = 1, . . . , J, all of these terms will combine in a single normal kernel, whose mean 
and covariance structure will depend heavily upon the particular choices of hyperpa- 
rameters in the matrix-normal prior for ipi. Each ujijk term, moreover has conditional 
posterior distribution 

i^^ijk I V'ijfc) ~ VQ{nij,ijjijk - Cijk) , 

leading to a simple MCMC that loops over centers and responses, drawing each 
vector of parameters ipi.k (that is, for all treatments at once) conditional on the 
other V'i-(-fc)'s. 



8 Final remarks 

We have shown that default Bayesian inference for multi-way categorical data can 
be implemented using a data augmentation scheme based on Polya-Gamma distri- 
butions. This leads to simple Gibbs and EM algorithms for posterior computation 
that exploit standard normal linear-model theory. 

It also opens the door for exact Bayesian treatments of many modern-day machine- 
learning classification methods based on mixtures of logits. Indeed, many likelihood 
functions long thought to be intractable resemble the sum-of-exponentials form in 
the multinomial logit model of Section [7) two prominent examples are restricted 
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Boltzmann machines (Salakhutdinov et al. 2007) and logistic-normal topic models 



(Blei and Lafferty, 2007). Applying the Polya-Gamma mixture framework to such 
problems is currently an active area of research. 

A number of technical details of our latent-variable representation are worth 
further comment. First, the dimensionality of the set of latent Wj^'s does not depend 
on the sample size Uij for each cell of the table. Rather, the sample size only affects 
the distribution of these latent variables. Therefore, our EM and MCMC algorithms 
are more parsimonious than traditional approaches that require one latent variable 
for each observation. 

Second, posterior updating via exponential tilting is a quite general situation 
that arises in Bayesian inference incorporating latent variables. For example, the 
posterior distribution of u that arises under normal data with precision u and a 
VQ{a,0) prior is precisely an exponentially titled VQ{a,0) random variable. This 
led to our characterization of the general VQ (a, c) class. 

Notice, moreover, that one may identify the conditional posterior for strictly 
using its moment-generating function, without ever appealing to Bayes' rule for 
density functions. This follows the Levy-penalty framework of [Poison and ScotT 



(2011b) and relates to work by Ciesielski and Taylor (1962), who use a similar 
argument to characterize sojourn times of Brownian motion. It offers the advantage 
of suggesting a simple route for simulating VQ (a, c) random variables, a crucial step 
in our computational results. Doubtless there are many other modeling situations 
where the basic idea is also applicable, or will lead to new insights. 



A Properties of Polya— Gamma random variables 

A.l The case P^(a,0) 

We construct the family of Polya-Gamma random variables as follows. Following 



Devroye (2009), a random variable J has a Jacobi distribution if 



oo 



J ^ A V - (9) 

^2 (A; -1/2)2' 



where the are independent, standard exponential random variables. The moment- 
generating function of this distribution is 

E(e-*-^) - ^ 



cosh(V2t) 

The density of this distribution is expressible as a multi-scale mixture of inverse- 
Gaussians; all moments are finite and expressible in terms of Riemann zeta functions. 
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For details, see Devroye (2009). 



The Jacobi is related to the Polya distribution (Barndorff- Nielsen et al. , 1982), 



in that if J has a Jacobi distribution, and u = J/4, then u ~ Pol(l/2, 1/2). 

Let Uk ~ Pol(l/2, 1/2) for = 1, . . . , n be a set of independent Polya-distributed 
random variables. A VQ{n,0) random variable is then defined by the sum = 
X]fc=i ^k- Its moment generating function follows from that of a Jacobi distribution, 
namely 



E{exp{-Ukt)} 



and E{exp{—u*t)} 



cosh(Vt/2) cosh"(v/t/2) 
The name "Polya-Gamma" arises from the following observation. From ([9]), 

eik 



D 



E(2.^g(fc-l/2)^ 



where ei k are independent exponential random variables. Rearranging terms. 



D i \ ^ 

2^^ 



En 
1=1 e^.fc 

r (k - i/2y 



D 



2^^ 



9k 



^ (A: -1/2)2 



where gt are i.i.d. Gamma(?7,, 1) random variables. More generally we may replace 
n with any positive real a. 



A. 2 The general case 

The general VQ{a^c) class arises through an exponential tilting of the VQ{a,Q) 
density: 

exp ( - yw) Pafi{uj) 
Ea,o |exp(-yu;)| 

where pafiiuj) is the density of a VQ{a, 0) random variable. Using the above results, 
along with Euler's expansion of the cosh function, write the moment-generating 
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function of this distribution as 



cosh" (^^^ 

^ 4{A:-l/2)27r2 



n I 



fc=l ' 4(A:-l/2)27r- 

oo 



/ 1 \ ■ 
+ d^Hy where 4 = 4 f A; - - j 



2 I 2 
TT + C . 



We can therefore write a VQ{a, c) random variable as 

D Ga(a, 1) 1 Ga(a, 1) 



k=l 



4 27r2^(A;-i)2 + cV(47r2)' 

appeahng to the moment-generating function of the gamma distribution. 



B Proofs of main results 



B.l Theorem [I 

With all the distributional theory of the previous section in place, the proofs of our 
main results will proceed very straightforwardly. 

Proof. Use the expressions for the moment-generating function of a Polya-Gamma 
random variable (given above) to write the likelihood in ([T]) as 

(^e^iyi (e^2)j/2 ^ 2""^ exp{fi;i^i} 2-"2 expj/taV^a} 

(TTe^^ ' (TTeV^ ~ cosh"^ (V'i/2) cosh"2(V^2/2) 

= 2-(ni+n2)^Ki^i E{exp(-a;iV'i/2} E{exp(-u;2^2V2} , 

where Uj ~ VQ{nj, 0), j = 1, 2; and where we recall that Kj = yj — nj/2. 
Given particular values of ui and U2, we can write p{ip \ D,fl) as 

p{^p \ D,il) oc exp{Kiipi - ■ exp(/t2^/'2 - 102^^1/2) p{il) \ /i, S) 



oc exp 



' 2 



Since p{il) \ fi, S) is a bivariate normal prior, the posterior is conditionally normal, 
with the specific form given by Part A of the theorem. 
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Turning now to Part B, we observe that the conditional posterior p{ujj \ ipjiD) 



is of the same form as (10), with ^jjj = c. We therefore arrive at the resuh by 
straightforwardly applying the previous section's distributional theory for VQ{a^c) 
random variables. 

□ 

B.2 Lemma H 

Proof. From the moment generating function for uj ~ VQ{a, 0) density evaluated at 
we have 

cosh-'^ (^) =E('e-5--' 



oo 

1, ,^2 



Taking logs and differentiating under the integral sign with respect to c then gives 
the moment identity 

EM = i|-iogco=r(H- 

Simple algebra reduces this down to the form given earlier 

E (w) = - tanh i^- 

□ 
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